세계 행복지수 데이타 회귀분석

Overflow: 차유진, 정진우, 현경근, 이경아

변수 해석

종속변수

Life Ladder: 설문 질문 “10이 누릴수 있는 최고의 삶이고, 0이 누릴수 있는 최저의 삶이라면, 당신의 현재 삶은 몇점인가?”의 나라별 평균 점수 출처: Gallup World Poll

독립변수

GDP (국내총생산): 2011년 국제 달러 가격으로 계산한 나라별 일년동안 생산활동 출처: World Development Indicators(WDI) (20170915)

Healthy Life Expectancy (건강 기대 수명): 사람의 연령별 사망률, 이환률, 기능적 건강 상태를 고려한 좋은 건강 수준에서 살아갈 수 있을 것으로 기대되는 나라별 평균 나이 출처: World Health Organization (WHO), WDI, 여러 논문의 통계를 통해서, Global Happiness Council에서 계산

Social Support: “필요하면 언제든 힘든 상황일때, 기댈수 있는 친척이나 친구분들이 있나요?” 에 관한 답변 (0 또는 1)에 국가별 평균 출처: Gallup World Poll

Freedom to make life choices: “지금 현재 본인의 삶의 선택의 자유성에 관해서 만족하나요 불만족하나요?” 에 관한 답변 (0 또는 1)에 국가별 평균 출처: Gallup World Poll

Generosity: “지난 한달 동안 자선 단체에 기부하신적이 있나요?”에 관한 국가별 평균 답변을, GDP per capita에 회귀한 잔차 출처: Gallup World Poll, World Happiness Report

Corruption Perception: “정부에 부패가 널리 퍼져있나요?”과 “사업내 비리가 널리 퍼져있나요?”에 관한 답변 (0 또는 1)의 국가별 평균 출처: Gallup World Poll

Positive Affect: “어저께 행복한 감정을 많이 느끼셨나요?”과 “어저께 웃거나 미소를 많이 지으셨나요?”과 “어저께 만족감을 많이 느끼셨나요?” 에 관한 답변 (0 또는 1)의 국가별 평균 출처: Gallup World Poll

Negative Affect: “어저께 걱정을 많이 하셨나요?”과 “어저께 슬픈 감정들을 많이 느끼셨나요?”과 “어저께 화를 많이 느끼셨나요?”의 답변 (각 0 또는 1)의 국가별 평균 출처: Gallup World Poll

Migrant Acceptance Index (거주 이민 수용 수치): 다음 문장에 대해서 개인적으로 좋게 생각하나요 나쁘게 생각하나요? “이민들이 이 나라에서 거주하다” “이민들이 나의 이웃이 되다” “이민들의 가까운 친척과 결혼을 하다” “좋은것이다”는 3점, “상황마다 틀리다” 또는 “잘 모르겠다”는 1 점, “나쁜것이다”는 0 점. 수치는 3가지 답변에 총 점수이다. 세계 평균 MAI는 5.29이다 출처: Gallup World Poll

Democratic Quality WGI에서 다양한 원천 데이터를 통해 만든 Voice and Accountability (VA) 과 Political Stability and Absence of Violence/Terrorism (PV)라는 종합수치의 단순 평균. u = 0, sd = 1, -2.5 에서 2.5의 수치

VA: 국가의 시민들이 정부를 선택 할수있는 자유, 표현의 자유, 단체 결사의 자유, 밑 언론의 자유를 통합한 수치

PV: 정부가 헌법위반이나 폭력적인 수단 (정치적 폭력또는 테러)으로 인해서 와해될 가능성 출처: World Governance Indicator, Global Happiness Council

Delivery Quality WGI에서 다양한 원천 데이터를 통해 만든 Government Effectiveness(GE), Regulatory Quality (RQ), Rule of Law (RL), Control of Corruption (CC)라는 종합수치의 단순 평균. u = 0, sd = 1, -2.5 에서 2.5의 수치

GE: 공공서비스, 정부 업무, 정치적 압박으로부터 독립성, 정책형성, 정책집행, 정책에 관한 정부의 확약에 관한 신뢰성

RQ: 민간부문 발전을 위한 정부의 능력에 관한 인식

RL: 사회규정, 계약 시행, 재산권리, 경찰, 법정들의 관한 신뢰성

CC: 개인 이익을 위한 공권력 사용 또는 정부 부패 출처: World Governance Indicator, Global Happiness Council

데이타 핸들링 / 클리닝

# suppressing warning, the warnings have been notified by our team, but was not a problem in the on overall analysis
options(warn=-1)
# importing libraries to local memory
library(pacman)
pacman::p_load(readr,dplyr,gvlma,data.table,car,ggplot2,lm.beta,
               corrplot,dplyr,MASS,Hmisc,r2glmm,olsrr,DAAG,highcharter)

# Import data
df <- data.table::fread(paste0('https://raw.githubusercontent.com/Jinwooooo/',
                               'fcdc-overflow-project-1/master/data/raw-data.csv'))
df$V1 <- NULL

# Remove white space in column names
reset.col <- c('country','year','life.ladder','log.gdp.per.cap','social.support',
               'healthy.life.expectancy.at.birth','freedom.to.make.life.choices','generosity',
               'perception.of.corruption','positive.affect','negative.affect','confidence.in.govt',
               'democratic.quality','delivery.quality','sd.of.ladder.by.country',
               'sd.mean.ladder.by.country','gini.index','gini.index.past.mean',
               'gini.house.hold.income')
colnames(df) <- reset.col

# Removing unused columns and filling na values with mean values (in order to actually use values as predict at the end)
df.main <- df %>%
  dplyr::select(-c(sd.of.ladder.by.country, sd.mean.ladder.by.country)) %>% 
  dplyr::group_by(year) %>% 
  dplyr::mutate_at(vars(-country),funs(replace(., which(is.nan(.)), mean(., na.rm = TRUE)))) %>%
  dplyr::mutate_at(vars(-country), funs(replace(., which(is.na(.)), mean(., na.rm = TRUE))))

df.main <- df.main %>% 
  dplyr::group_by(country) %>%
  dplyr::mutate_all(funs(replace(., which(is.nan(.)), mean(., na.rm = TRUE)))) %>% 
  dplyr::mutate_all(funs(replace(., which(is.na(.)), mean(., na.rm = TRUE))))

df.main$gini.house.hold.income <- Hmisc::impute(df.main$gini.house.hold.income,mean)
# [debug] : checking for na values
# colSums(is.na(df.main))

# Box-whisker plot to see life ladder's mean and IQR values per year
ggplot2::ggplot(df.main, aes(x = year, y = life.ladder, group = year)) + ggplot2::geom_boxplot()

ggplot2::ggplot(df.main, aes(x = year, y = log.gdp.per.cap, group = year)) + ggplot2::geom_boxplot()

# to see if it's somewhere near normal distribution
ggplot2::ggplot(df.main, aes(life.ladder)) +
  ggplot2::geom_histogram(aes(y = ..density..),
                          bins = 20,
                          col = 4,
                          fill = "blue", 
                          alpha = 0.4) +
  ggplot2::geom_density(col = 6) +
  ggplot2::labs(x = "life.ladder", y = "count")

# Split into train and test data
df.train <- df.main %>% dplyr::filter(year >= 2005 & year <= 2014)
df.test <- df.main %>% dplyr::filter(year > 2014 & year < 2017)

# Reduce each set to means
df.train.mean <- df.train %>% 
  dplyr::group_by(country) %>%
  dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>% 
  dplyr::select(-country)
df.test.mean <- df.test %>% 
  dplyr::group_by(country) %>%
  dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>% 
  dplyr::select(-country)

EDA / Data Visualization

# extracting required data for data visualization
df.train.map <- df.train %>% 
  dplyr::group_by(country) %>%
  dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>% 
  dplyr::select(-year)

# data is from highcharter map library, the grouping for continents were off, but wasn't a problem in overall EDA
af.data <- list(region='Africa',
                country=(get_data_from_map(download_map_data('custom/africa')))$name)
asia.data <- list(region='Asia',
                  country=(get_data_from_map(download_map_data('custom/asia')))$name)
eu.data <- list(region='Europe',
                country=(get_data_from_map(download_map_data('custom/europe')))$name)
na.data <- list(region='North America',
                country=(get_data_from_map(download_map_data('custom/north-america')))$name)
oc.data <- list(region='Oceania',
                country=(get_data_from_map(download_map_data('custom/oceania')))$name)
sa.data <- list(region='South America',
                country=(get_data_from_map(download_map_data('custom/south-america')))$name)

world.data <- Reduce(function(x, y) merge(x, y, all=TRUE), 
                     list(af.data, asia.data, eu.data, na.data, oc.data, sa.data))

# values that are different were brute force searched and then substituted
# missing.countries <- (df.train.map[!(df.train.map$country %in% map.data$name),])[,1]
# excluded.countries <- (map.data[!(map.data$name %in% df.train.map$country),])$name
df.train.map[df.train.map$country == "Congo (Brazzaville)",1] = 'Democratic Republic of the Congo'
df.train.map[df.train.map$country == "Congo (Kinshasa)",1] = 'Republic of Congo'
df.train.map[df.train.map$country == "North Cyprus",1] = 'Northern Cyprus'
df.train.map[df.train.map$country == "Serbia",1] = 'Republic of Serbia'
df.train.map[df.train.map$country == "Somaliland region",1] = 'Somaliland'
df.train.map[df.train.map$country == "Taiwan Province of China",1] = 'Taiwan'
df.train.map[df.train.map$country == "Tanzania",1] = 'United Republic of Tanzania'
df.train.map[df.train.map$country == "United States",1] = 'United States of America'

# adding region column according to country
df.final <- merge(df.train.map, world.data, by='country', all.x=TRUE)
df.final$region <- as.character(df.final$region)

# adding mean life ladder value according to region
region.happiness <- c()
for(i in unique(df.final$region)) {
  summary <- df.final[df.final$region == i,]
  region.happiness <- c(region.happiness,round(mean(summary$life.ladder,na.rm=TRUE),digits=3))
}
df.outer <- (data_frame(name = unique(df.final$region),
                       y = round(region.happiness,digits=3),
                       drilldown=tolower(unique(df.final$region))))[1:6,]
df.outer <- df.outer[order(df.outer$y, decreasing=TRUE),]

# base bargraph (outer bargraph)
hc <- highchart() %>% 
  hc_chart(type='column') %>% 
  hc_title(text='Drilldown by Life Ladder for Regions') %>% 
  hc_xAxis(type='category') %>% 
  hc_legend(enabled=FALSE) %>% 
  hc_plotOptions(series=list(borderWidth=0,dataLabels=list(enabled=TRUE))) %>%
  hc_add_series(data=df.outer, name="Life Ladder",colorByPoint=TRUE)

# in order to add into drilldown, each data must be separated by continent and all country
# and life.ladder values are intact
africa <- df.final[df.final$region == 'Africa',][c('country','life.ladder')]
colnames(africa) <- c('name','value')
africa <- africa[!is.na(africa$name),]
africa <- africa[order(africa$value,decreasing=TRUE),]
africa$value <- round(africa$value, digits=3)
asia <- df.final[df.final$region == 'Asia',][c('country','life.ladder')]
colnames(asia) <- c('name','value')
asia <- asia[!is.na(asia$name),]
asia <- asia[order(asia$value,decreasing=TRUE),]
asia$value <- round(asia$value, digits=3)
europe <- df.final[df.final$region == 'Europe',][c('country','life.ladder')]
colnames(europe) <- c('name','value')
europe <- europe[!is.na(europe$name),]
europe <- europe[order(europe$value,decreasing=TRUE),]
europe$value <- round(europe$value, digits=3)
north.america <- df.final[df.final$region == 'North America',][c('country','life.ladder')]
colnames(north.america) <- c('name','value')
north.america <- north.america[!is.na(north.america$name),]
north.america <- north.america[order(north.america$value,decreasing=TRUE),]
north.america$value <- round(north.america$value, digits=3)
oceania <- df.final[df.final$region == 'Oceania',][c('country','life.ladder')]
colnames(oceania) <- c('name','value')
oceania <- oceania[!is.na(oceania$name),]
oceania <- oceania[order(oceania$value,decreasing=TRUE),]
oceania$value <- round(oceania$value, digits=3)
south.america <- df.final[df.final$region == 'South America',][c('country','life.ladder')]
colnames(south.america) <- c('name','value')
south.america <- south.america[!is.na(south.america$name),]
south.america <- south.america[order(south.america$value,decreasing=TRUE),]
south.america$value <- round(south.america$value, digits=3)

# adding drilldown
hc %>%
  hc_drilldown(allowPointDrilldown = TRUE,
               series = list(
                 list(id = 'africa', data = list_parse2(africa)),
                 list(id = 'asia', data = list_parse2(asia)),
                 list(id = 'europe', data = list_parse2(europe)),
                 list(id = 'north america', data = list_parse2(north.america)),
                 list(id = 'oceania', data = list_parse2(oceania)),
                 list(id = 'south america', data = list_parse2(south.america))
                 )
               )
# due to too many maps, other variables have been exlcuded from this report
# for other map visualization on other variables that's not life.ladder, see map-visualization.r file
map.life.ladder <- hcmap("custom/world-robinson-highres", data = df.train.map, value = "life.ladder",
                         joinBy = c('name','country'), name = 'Life Ladder',
                         dataLabels = list(enabled = TRUE, format = '{point.name}'),
                         borderColor = "#000000", borderWidth = 0.1,
                         tooltip = list(valueDecimals = 3),
                         nullColor = '#FF0000') %>% 
                         hc_title(text = 'Life Ladder World Map')
map.life.ladder

full model: 모든 변수를 다 포함한 모형. 이중 통계적으로 유의미하지 않은 변수를 빼준다.

# Initial Linear Model
lm.train.mean <- lm(life.ladder ~. -year, data = df.train.mean)
summary(lm.train.mean)
## 
## Call:
## lm(formula = life.ladder ~ . - year, data = df.train.mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30833 -0.24882  0.00648  0.26352  0.99401 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -0.664166   0.781652  -0.850  0.39686    
## log.gdp.per.cap                   0.301018   0.065862   4.570 1.01e-05 ***
## social.support                    1.601548   0.512657   3.124  0.00214 ** 
## healthy.life.expectancy.at.birth  0.007102   0.009833   0.722  0.47128    
## freedom.to.make.life.choices      0.743791   0.435332   1.709  0.08961 .  
## generosity                        0.165156   0.310855   0.531  0.59600    
## perception.of.corruption         -0.981273   0.308673  -3.179  0.00180 ** 
## positive.affect                   3.440644   0.689705   4.989 1.67e-06 ***
## negative.affect                   1.148848   0.656743   1.749  0.08230 .  
## confidence.in.govt               -0.856688   0.309801  -2.765  0.00641 ** 
## democratic.quality               -0.032719   0.096206  -0.340  0.73427    
## delivery.quality                  0.092323   0.110959   0.832  0.40671    
## gini.index                        1.050375   1.353571   0.776  0.43898    
## gini.index.past.mean             -1.185610   0.985106  -1.204  0.23068    
## gini.house.hold.income           -1.012306   0.497685  -2.034  0.04372 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4521 on 149 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8261 
## F-statistic: 56.33 on 14 and 149 DF,  p-value: < 2.2e-16
gvlma::gvlma(lm.train.mean)
## 
## Call:
## lm(formula = life.ladder ~ . - year, data = df.train.mean)
## 
## Coefficients:
##                      (Intercept)                   log.gdp.per.cap  
##                        -0.664166                          0.301018  
##                   social.support  healthy.life.expectancy.at.birth  
##                         1.601548                          0.007102  
##     freedom.to.make.life.choices                        generosity  
##                         0.743791                          0.165156  
##         perception.of.corruption                   positive.affect  
##                        -0.981273                          3.440644  
##                  negative.affect                confidence.in.govt  
##                         1.148848                         -0.856688  
##               democratic.quality                  delivery.quality  
##                        -0.032719                          0.092323  
##                       gini.index              gini.index.past.mean  
##                         1.050375                         -1.185610  
##           gini.house.hold.income  
##                        -1.012306  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = lm.train.mean) 
## 
##                      Value   p-value                   Decision
## Global Stat        25.9816 3.192e-05 Assumptions NOT satisfied!
## Skewness            2.1188 1.455e-01    Assumptions acceptable.
## Kurtosis            0.5836 4.449e-01    Assumptions acceptable.
## Link Function      23.0574 1.572e-06 Assumptions NOT satisfied!
## Heteroscedasticity  0.2219 6.376e-01    Assumptions acceptable.
# Selecting independent variable that have relation to the dependent variable (stepwise)
step(lm.train.mean, df.train.mean, direction = "both")
## Start:  AIC=-246.15
## life.ladder ~ (year + log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
##     freedom.to.make.life.choices + generosity + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     democratic.quality + delivery.quality + gini.index + gini.index.past.mean + 
##     gini.house.hold.income) - year
## 
##                                    Df Sum of Sq    RSS     AIC
## - democratic.quality                1    0.0236 30.472 -248.02
## - generosity                        1    0.0577 30.506 -247.84
## - healthy.life.expectancy.at.birth  1    0.1066 30.555 -247.57
## - gini.index                        1    0.1231 30.572 -247.49
## - delivery.quality                  1    0.1415 30.590 -247.39
## - gini.index.past.mean              1    0.2960 30.745 -246.56
## <none>                                          30.449 -246.15
## - freedom.to.make.life.choices      1    0.5965 31.045 -244.96
## - negative.affect                   1    0.6253 31.074 -244.81
## - gini.house.hold.income            1    0.8455 31.294 -243.66
## - confidence.in.govt                1    1.5627 32.011 -239.94
## - social.support                    1    1.9944 32.443 -237.74
## - perception.of.corruption          1    2.0652 32.514 -237.38
## - log.gdp.per.cap                   1    4.2687 34.718 -226.63
## - positive.affect                   1    5.0855 35.534 -222.82
## 
## Step:  AIC=-248.02
## life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
##     freedom.to.make.life.choices + generosity + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     delivery.quality + gini.index + gini.index.past.mean + gini.house.hold.income
## 
##                                    Df Sum of Sq    RSS     AIC
## - generosity                        1    0.0684 30.541 -249.65
## - healthy.life.expectancy.at.birth  1    0.1109 30.583 -249.42
## - gini.index                        1    0.1363 30.609 -249.29
## - delivery.quality                  1    0.1693 30.642 -249.11
## - gini.index.past.mean              1    0.3183 30.791 -248.31
## <none>                                          30.472 -248.02
## - freedom.to.make.life.choices      1    0.5755 31.048 -246.95
## - negative.affect                   1    0.6252 31.098 -246.69
## - gini.house.hold.income            1    0.8242 31.297 -245.64
## - confidence.in.govt                1    1.5403 32.013 -241.93
## - social.support                    1    1.9713 32.444 -239.74
## - perception.of.corruption          1    2.3266 32.799 -237.95
## - log.gdp.per.cap                   1    4.6013 35.074 -226.96
## - positive.affect                   1    5.0620 35.534 -224.82
## 
## Step:  AIC=-249.65
## life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     delivery.quality + gini.index + gini.index.past.mean + gini.house.hold.income
## 
##                                    Df Sum of Sq    RSS     AIC
## - healthy.life.expectancy.at.birth  1    0.1125 30.653 -251.05
## - gini.index                        1    0.1429 30.684 -250.89
## - delivery.quality                  1    0.2013 30.742 -250.57
## <none>                                          30.541 -249.65
## - gini.index.past.mean              1    0.4190 30.960 -249.42
## - freedom.to.make.life.choices      1    0.5840 31.125 -248.54
## - negative.affect                   1    0.7559 31.297 -247.64
## - gini.house.hold.income            1    0.7833 31.324 -247.50
## - confidence.in.govt                1    1.4847 32.026 -243.87
## - social.support                    1    1.9574 32.498 -241.46
## - perception.of.corruption          1    2.4197 32.960 -239.15
## - log.gdp.per.cap                   1    4.6060 35.147 -228.61
## - positive.affect                   1    6.8724 37.413 -218.37
## 
## Step:  AIC=-251.05
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + delivery.quality + gini.index + gini.index.past.mean + 
##     gini.house.hold.income
## 
##                                Df Sum of Sq    RSS     AIC
## - gini.index                    1    0.2970 30.950 -251.47
## <none>                                      30.653 -251.05
## - delivery.quality              1    0.3847 31.038 -251.00
## - freedom.to.make.life.choices  1    0.5679 31.221 -250.04
## - gini.index.past.mean          1    0.7254 31.379 -249.21
## - negative.affect               1    0.8567 31.510 -248.53
## - gini.house.hold.income        1    1.0249 31.678 -247.66
## - confidence.in.govt            1    1.5553 32.209 -244.93
## - social.support                1    2.0424 32.696 -242.47
## - perception.of.corruption      1    2.4603 33.114 -240.39
## - log.gdp.per.cap               1    6.7900 37.443 -220.23
## - positive.affect               1    7.1994 37.853 -218.45
## 
## Step:  AIC=-251.47
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + delivery.quality + gini.index.past.mean + 
##     gini.house.hold.income
## 
##                                Df Sum of Sq    RSS     AIC
## - delivery.quality              1    0.2975 31.248 -251.90
## <none>                                      30.950 -251.47
## - gini.index.past.mean          1    0.4405 31.391 -251.15
## - freedom.to.make.life.choices  1    0.5084 31.459 -250.79
## - negative.affect               1    1.1664 32.117 -247.40
## - gini.house.hold.income        1    1.4421 32.392 -246.00
## - confidence.in.govt            1    1.5854 32.536 -245.27
## - social.support                1    1.9594 32.910 -243.40
## - perception.of.corruption      1    2.6472 33.598 -240.01
## - log.gdp.per.cap               1    6.7913 37.742 -220.93
## - positive.affect               1    8.4208 39.371 -214.00
## 
## Step:  AIC=-251.9
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + gini.index.past.mean + gini.house.hold.income
## 
##                                Df Sum of Sq    RSS     AIC
## <none>                                      31.248 -251.90
## - gini.index.past.mean          1    0.5679 31.816 -250.94
## - freedom.to.make.life.choices  1    0.7175 31.965 -250.18
## - negative.affect               1    1.0945 32.342 -248.25
## - gini.house.hold.income        1    1.3335 32.581 -247.04
## - confidence.in.govt            1    1.8491 33.097 -244.47
## - social.support                1    2.1096 33.357 -243.18
## - perception.of.corruption      1    3.2637 34.512 -237.61
## - positive.affect               1    8.4022 39.650 -214.84
## - log.gdp.per.cap               1   10.6645 41.912 -205.74
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     gini.index.past.mean + gini.house.hold.income, data = df.train.mean)
## 
## Coefficients:
##                  (Intercept)               log.gdp.per.cap  
##                      -0.5445                        0.3502  
##               social.support  freedom.to.make.life.choices  
##                       1.6200                        0.7872  
##     perception.of.corruption               positive.affect  
##                      -1.1493                        3.8305  
##              negative.affect            confidence.in.govt  
##                       1.4255                       -0.9072  
##         gini.index.past.mean        gini.house.hold.income  
##                      -0.9835                       -1.1608

second model: AIC를 고려한 후의 회귀모형

# Build new model
lm.train.step <- lm(life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
                   freedom.to.make.life.choices + perception.of.corruption + 
                   positive.affect + negative.affect + confidence.in.govt + 
                   gini.index.past.mean + gini.house.hold.income, data = df.train.mean)
summary(lm.train.step)
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     healthy.life.expectancy.at.birth + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + gini.index.past.mean + gini.house.hold.income, 
##     data = df.train.mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.28765 -0.22980  0.03101  0.25587  1.10491 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -0.892713   0.696774  -1.281 0.202059    
## log.gdp.per.cap                   0.300777   0.058718   5.122 8.96e-07 ***
## social.support                    1.565495   0.501918   3.119 0.002169 ** 
## healthy.life.expectancy.at.birth  0.012622   0.008594   1.469 0.143976    
## freedom.to.make.life.choices      0.782478   0.417068   1.876 0.062541 .  
## perception.of.corruption         -1.089667   0.288366  -3.779 0.000225 ***
## positive.affect                   3.655918   0.604839   6.044 1.10e-08 ***
## negative.affect                   1.252073   0.622795   2.010 0.046145 *  
## confidence.in.govt               -0.853102   0.301644  -2.828 0.005309 ** 
## gini.index.past.mean             -0.789266   0.600414  -1.315 0.190633    
## gini.house.hold.income           -0.944341   0.474583  -1.990 0.048391 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4488 on 153 degrees of freedom
## Multiple R-squared:  0.8392, Adjusted R-squared:  0.8287 
## F-statistic: 79.84 on 10 and 153 DF,  p-value: < 2.2e-16
gvlma::gvlma(lm.train.step)
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     healthy.life.expectancy.at.birth + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + gini.index.past.mean + gini.house.hold.income, 
##     data = df.train.mean)
## 
## Coefficients:
##                      (Intercept)                   log.gdp.per.cap  
##                         -0.89271                           0.30078  
##                   social.support  healthy.life.expectancy.at.birth  
##                          1.56550                           0.01262  
##     freedom.to.make.life.choices          perception.of.corruption  
##                          0.78248                          -1.08967  
##                  positive.affect                   negative.affect  
##                          3.65592                           1.25207  
##               confidence.in.govt              gini.index.past.mean  
##                         -0.85310                          -0.78927  
##           gini.house.hold.income  
##                         -0.94434  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = lm.train.step) 
## 
##                      Value   p-value                   Decision
## Global Stat        24.7157 5.739e-05 Assumptions NOT satisfied!
## Skewness            1.3049 2.533e-01    Assumptions acceptable.
## Kurtosis            0.5750 4.483e-01    Assumptions acceptable.
## Link Function      22.3620 2.258e-06 Assumptions NOT satisfied!
## Heteroscedasticity  0.4737 4.913e-01    Assumptions acceptable.

통계적으로 유의미하지 않은 변수를 빼준 후의 회귀모형

# Removing variables with little impact (healthy.life, gini.index.past.mean)
lm.train.step2 <- lm(life.ladder ~ log.gdp.per.cap + social.support +
                    freedom.to.make.life.choices + perception.of.corruption + 
                    positive.affect + negative.affect + confidence.in.govt + gini.house.hold.income, 
                    data = df.train.mean)
summary(lm.train.step2)
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     gini.house.hold.income, data = df.train.mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24193 -0.25445  0.02696  0.28740  1.05503 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -0.58285    0.66106  -0.882 0.379308    
## log.gdp.per.cap               0.36480    0.04778   7.634 2.17e-12 ***
## social.support                1.58478    0.50488   3.139 0.002031 ** 
## freedom.to.make.life.choices  0.87281    0.41788   2.089 0.038374 *  
## perception.of.corruption     -1.23015    0.28409  -4.330 2.66e-05 ***
## positive.affect               3.45735    0.55509   6.229 4.23e-09 ***
## negative.affect               1.28225    0.61130   2.098 0.037566 *  
## confidence.in.govt           -0.94773    0.30126  -3.146 0.001986 ** 
## gini.house.hold.income       -1.44787    0.42146  -3.435 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4531 on 155 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8254 
## F-statistic:  97.3 on 8 and 155 DF,  p-value: < 2.2e-16
gvlma(lm.train.step2)
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     gini.house.hold.income, data = df.train.mean)
## 
## Coefficients:
##                  (Intercept)               log.gdp.per.cap  
##                      -0.5828                        0.3648  
##               social.support  freedom.to.make.life.choices  
##                       1.5848                        0.8728  
##     perception.of.corruption               positive.affect  
##                      -1.2302                        3.4574  
##              negative.affect            confidence.in.govt  
##                       1.2822                       -0.9477  
##       gini.house.hold.income  
##                      -1.4479  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm.train.step2) 
## 
##                        Value   p-value                   Decision
## Global Stat        27.125424 1.875e-05 Assumptions NOT satisfied!
## Skewness            1.132482 2.872e-01    Assumptions acceptable.
## Kurtosis            0.005501 9.409e-01    Assumptions acceptable.
## Link Function      24.886980 6.079e-07 Assumptions NOT satisfied!
## Heteroscedasticity  1.100461 2.942e-01    Assumptions acceptable.

모델 선정: glm Time Series 모델과 비교

# Removing healthy.life.expectany.at.birth, gini.index.past.mean, generosity columns from dataset
df.train.mean2 <- df.train.mean %>% 
  dplyr::select(-c(healthy.life.expectancy.at.birth, gini.index.past.mean, generosity))

# PQL (Penalized Quasi-Likelihood) <<< this is for one of the models from time series
# Random Intercept (random = ~1|year)
glm.train.mean2 <- MASS::glmmPQL(life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + 
                     perception.of.corruption + positive.affect + negative.affect + 
                     confidence.in.govt + gini.house.hold.income, ~1 | year, family = Gamma,
                     data = df.test.mean, verbose = FALSE)
summary(glm.train.mean2)
## Linear mixed-effects model fit by maximum likelihood
##  Data: df.test.mean 
##   AIC BIC logLik
##    NA  NA     NA
## 
## Random effects:
##  Formula: ~1 | year
##          (Intercept)   Residual
## StdDev: 2.669923e-07 0.09230614
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~invwt 
## Fixed effects: life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +      perception.of.corruption + positive.affect + negative.affect +      confidence.in.govt + gini.house.hold.income 
##                                   Value   Std.Error  DF   t-value p-value
## (Intercept)                   0.4684654 0.029136930 137 16.078063  0.0000
## log.gdp.per.cap              -0.0145448 0.001996504 137 -7.285157  0.0000
## social.support               -0.0975307 0.022178397 137 -4.397555  0.0000
## freedom.to.make.life.choices -0.0637479 0.019253580 137 -3.310966  0.0012
## perception.of.corruption      0.0125634 0.009933901 137  1.264701  0.2081
## positive.affect              -0.0679222 0.021237152 137 -3.198273  0.0017
## negative.affect              -0.0365868 0.024105853 137 -1.517756  0.1314
## confidence.in.govt            0.0326451 0.010574072 137  3.087282  0.0024
## gini.house.hold.income        0.0419657 0.017488778 137  2.399581  0.0178
##  Correlation: 
##                              (Intr) lg.g.. scl.sp fr.... prcp.. pstv.f
## log.gdp.per.cap              -0.477                                   
## social.support               -0.321 -0.438                            
## freedom.to.make.life.choices  0.034 -0.200 -0.143                     
## perception.of.corruption     -0.566  0.384 -0.072  0.181              
## positive.affect              -0.209  0.074 -0.207 -0.487 -0.037       
## negative.affect              -0.328 -0.143  0.355 -0.045 -0.250  0.196
## confidence.in.govt           -0.423  0.287  0.166 -0.388  0.379  0.092
## gini.house.hold.income       -0.382  0.187  0.253 -0.137  0.127 -0.150
##                              ngtv.f cnfd..
## log.gdp.per.cap                           
## social.support                            
## freedom.to.make.life.choices              
## perception.of.corruption                  
## positive.affect                           
## negative.affect                           
## confidence.in.govt            0.060       
## gini.house.hold.income       -0.219  0.012
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -2.7297312 -0.5848387 -0.0577856  0.5345243  2.8768158 
## 
## Number of Observations: 148
## Number of Groups: 3
r2glmm::r2beta(glm.train.mean2)
##                         Effect   Rsq upper.CL lower.CL
## 1                        Model 0.796    0.841    0.751
## 2              log.gdp.per.cap 0.249    0.366    0.143
## 3               social.support 0.125    0.233    0.044
## 4 freedom.to.make.life.choices 0.066    0.159    0.010
## 6              positive.affect 0.059    0.150    0.008
## 8           confidence.in.govt 0.052    0.140    0.005
## 9       gini.house.hold.income 0.033    0.111    0.001
## 7              negative.affect 0.015    0.077    0.000
## 5     perception.of.corruption 0.008    0.061    0.000

모델 비교

anova(lm.train.mean, lm.train.step, lm.train.step2)
## Analysis of Variance Table
## 
## Model 1: life.ladder ~ (year + log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
##     freedom.to.make.life.choices + generosity + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     democratic.quality + delivery.quality + gini.index + gini.index.past.mean + 
##     gini.house.hold.income) - year
## Model 2: life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     gini.index.past.mean + gini.house.hold.income
## Model 3: life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + 
##     perception.of.corruption + positive.affect + negative.affect + 
##     confidence.in.govt + gini.house.hold.income
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    149 30.449                              
## 2    153 30.813 -4  -0.36465 0.4461 0.77511  
## 3    155 31.816 -2  -1.00227 2.4523 0.08956 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

이상치 및 다중공선성(Outliers and Multicollinearlity) Test

# testing if there are outliers that are influencing the R^2 value
car::outlierTest(lm.train.step2, data = df.train.mean)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferonni p
## 137 -2.896202          0.0043272      0.70965
par(mfrow = c(2, 2))
plot(lm.train.step2)

par(mfrow = c(1,1))
# Conclusion : Sri Lanka is an outlier, but not omitted, due to being 1 value out of many

# Correlation plot (visually checking for multicollinearity)
lm.corr <- cor(df.train.mean2[3:ncol(df.train.mean2)])
corrplot::corrplot(lm.corr, method = "number")

최종모델 해석

1단계 : 회귀모형은 타당한가? 귀무가설 : 회귀모형은 타당하지 않다. 대립가설 : 회귀모형은 타당하다. F-statistic: 98.83 on 8 and 153 DF, p-value: < 2.2e-16 결론 : 유의확률이 0.000이므로 유의수준 0.05에서 회귀모형은 통계적으로 타당하다.

2단계 : 독립변수는 종속변수에게 영향을 주는가? (조건 : 1단계의 결론이 대립가설이어야 함. 이 회귀모형의 경우 대립가설임.) 귀무가설 : 독립변수는 종속변수에게 영향을 주지 않는다(beta1 = 0). 대립가설 : 독립변수는 종속변수에게 영향을 준다(beta1 not equal 0).

                             Estimate    Std. Error t value Pr(>|t|)  
log.gdp.per.cap               0.37435    0.04702   7.961 3.59e-13 ***
social.support                1.40026    0.52229   2.681 0.008145 **
freedom.to.make.life.choices  0.98169    0.45543   2.156 0.032684 *
perception.of.corruption     -1.20868    0.28250  -4.278 3.30e-05 ***
positive.affect               3.44207    0.55740   6.175 5.69e-09 ***
negative.affect               1.17269    0.60926   1.925 0.056113 .
confidence.in.govt           -0.99422    0.30707  -3.238 0.001477 **
gini.house.hold.income       -1.54129    0.41282  -3.734 0.000266 ***

결론 : 유의확률이 0.000이므로 유의수준 0.05에서 모형에 포함된 독립변수들은 종속변수에게 모두 통계적으로 유의한 영향을 주는 것으로 나타났다. 즉 log.gdp.per.cap, freedom.to.make.life.choices, generosity, perception.of.corruption, confidence.in.govt는 모두 life ladder (행복지수) 에 통계적으로 유의한 영향을 준다.

** 3단계 : 독립변수는 어떠한 영향을 주는가? ** (조건 : 2단계의 결론이 대립가설이어야 함. 이 회귀모형의 경우대립가설임.)

                             Estimate
log.gdp.per.cap               0.37435
social.support                1.40026
freedom.to.make.life.choices  0.98169
perception.of.corruption     -1.20868
positive.affect               3.44207
negative.affect               1.17269
confidence.in.govt           -0.99422
gini.house.hold.income       -1.54129

(다른 독립변수가 모두 고정되어 있을때!!!)

  1. 독립변수인 log.gdp.per.cap 의 기본단위가 1 증가하면, 종속변수인 ladder는 약 0.37435 정도 증가한다.
  2. 독립변수인 social.support가 1 증가하면, 종속변수인 ladder는 약 1.40026 증가한다.
  3. 독립변수인 freedom.to.make.life.choices가 1 증가하면, 종속변수인 ladder는 약 0.98169 증가한다.
  4. 독립변수인 perception.of.corruption가 1 증가하면, 종속변수인 ladder는 약 1.20868 감소한다.
  5. 독립변수인 positive.affect가 1 증가하면, 종속변수인 ladder는 약 3.44207 증가한다.
  6. 독립변수인 negative.affect가 1 증가하면, 종속변수인 ladder는 약 1.17269 증가한다.
  7. 독립변수인 confidence.in.govt 가 1 증가하면, 종속변수인 ladder는 약 0.99422 감소한다.
  8. 독립변수인 gini.house.hold.income가 1 증가하면, 종속변수인 ladder는 약 1.54129 감소한다.

** 4단계 : 회귀모형의 설명력 = 독립변수의 설명력 ** Multiple R-squared: 0.8379, R^2 = SSR / SST : 결정계수(Coefficient of Determination) = 0.8379 * 회귀모형의 설명력은 약 83.79% * 또는 독립변수들의 설명력은 약 83.79% * 종속변수의 다름이 100%라고 하면 독립변수가 종속변수의 다름을 약 83.79% 설명하고 있다(독립변수 때문에 종속변수의 다름이 약 83.79% 발생했다)

** 세부분석 ** # 다중공선성(Multicolinearity) : 독립변수들 간의 상관관계를 파악함 VIF(Variation Inflation Factor) = 분산팽창인자 = 분산팽창요인 VIF 값이 10 이상이면 독립변수들 간에 다중공선성(선형관계)이 존재한다고 파악 -> 회귀분석 결과 해석 못 함.

car::vif(lm.train.step2)
##              log.gdp.per.cap               social.support 
##                     2.638470                     2.724857 
## freedom.to.make.life.choices     perception.of.corruption 
##                     2.588469                     1.758110 
##              positive.affect              negative.affect 
##                     2.420524                     1.476022 
##           confidence.in.govt       gini.house.hold.income 
##                     1.670271                     1.417159

3개의 VIF 값이 10 이상인 값이 하나도 없으므로 독립변수들 간에 다중공선성은 존재하지 않는다.

영향력이 가장 큰 독립변수는? 독립변수의 단위가 모두 같다면, 회귀계수의 절대값이 가장 큰 것이 가장 영향력이 크다. 독립변수의 단위가 다르면, 표준화된 회귀계수의 절대값이 가장 큰 것이 가장 영향력이 크다. lm.beta::lm.beta(lm.result) 를 사용해서 표준화된 회귀계수 값을 구한다. lm.beta는 표준화된 회귀계수를 구해준다.

lm.beta::lm.beta(lm.train.step2)
## 
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support + 
##     freedom.to.make.life.choices + perception.of.corruption + 
##     positive.affect + negative.affect + confidence.in.govt + 
##     gini.house.hold.income, data = df.train.mean)
## 
## Standardized Coefficients::
##                  (Intercept)              log.gdp.per.cap 
##                   0.00000000                   0.40589351 
##               social.support freedom.to.make.life.choices 
##                   0.16959329                   0.10998771 
##     perception.of.corruption              positive.affect 
##                  -0.18792301                   0.31717351 
##              negative.affect           confidence.in.govt 
##                   0.08341066                  -0.13307210 
##       gini.house.hold.income 
##                  -0.13385792
             (Intercept)              log.gdp.per.cap               social.support freedom.to.make.life.choices 
              0.00000000                   0.41760768                   0.16409732                   0.10531958 
perception.of.corruption              positive.affect              negative.affect           confidence.in.govt 
             -0.18943523                   0.31779551                   0.08004477                  -0.13943904 
  gini.house.hold.income 
             -0.13148714                   

결론 : life.ladder (행복지수)에 log.gdp.per.cap (beta = 0.5726012)가 가장 영향력이 크고, 그 다음으로 positive.affect(beta=0.31779551), perception.of.corruption (beta = -0.189435), social support (beta=0.16409), confidnece in government (beta = -0.13943904), gini.house.hold.income (beta = -0.13148), freedom.to.make.life.choices (beta=0.1053), negative.affect (beta=0.0800), generosity (beta=0.0615), 순서로 나타났다.

잔차 분석(Residual Analysis)

회귀분석을 위한 가정(Assumption)들은 만족하는지를 확인

(1) 정규성(Normality)

종속변수가 정규분포를 따른다면, 잔차(residual value) 또한 정규분포를 따르며, 평균은 0 이다. i. Normal Q-Q plot : x축은 잔차의 사분위수(Quartiles), y축은 잔차의 표준화. 정규성 가정을 만족한다면 Normal Q-Q plot는 직선에 가까운 그래프가 되어야 한다. 직선을 벗어나는 점들이 많으면 정규성 가정이 깨질 가능성이 높다.

qqnorm(lm.train.step2$residuals)
qqline(lm.train.step2$residuals)

# Testing for autocorrelation of errors
olsrr::ols_test_correlation(lm.train.step2) # 0.995226
## [1] 0.996192
  1. Shapiro-Wilk test: 잔차에 대해서 정규성 검정을 한다. 잔차는 정규성 검정을 통과한다: p-value는 0.4368, 즉 귀무가설: 잔차는 정규분포이다.
# method 1: Testing if residuals are normally distributed
shapiro.test(lm.train.step2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lm.train.step2$residuals
## W = 0.99148, p-value = 0.4389
# method 2: Testing if residuals are normally distributed
# olsrr::ols_test_normality(lm.train.step2)
qqp(lm.train.step2, data = df.train.mean2)

## [1]  60 137

(2) 등분산성 (Heteroscedasticity)

x축 : 독립변수 또는 종속변수의 추정치 y축 : 잔차 또는 표준화 잔차 분산이 일정하다는 가정을 만족한다면 잔차의 값이 0을 기준으로 위/아래로 랜덤random)하게 분포되어 있어야 한다. 일정한 패턴을 가지면 등분산성은 깨진다.

p-value가 0.7112, 즉 0.05보다 크다. 귀무가설: 잔차의 값이 0을 기준으로 위/아래로 랜덤random)하게 분포되어 있다. 등분산성 조건 만족.

# Check for homoscedasticity
car::ncvTest(lm.train.step2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.204514    Df = 1     p = 0.6511016
spreadLevelPlot(lm.train.step2)

## 
## Suggested power transformation:  1.349126
# Conclusion : No change required

(3) 독립성(Independence)

종속변수는 서로 독립적이어야 한다. 관찰값들 사이에 상관관계가 있을 경우, 오차항들이 서로 독립이라는 조건을 검토해 보아야 한다. 만일 오차항들이 서로 독립이라면, 잔차들은 무작위로 흩어져 있을 것이고, 그렇지 않다면 무작위로 흩어져 있지 않아 오차항들 사이에 상관관계가 있다고 할 수 있을 것이다. i. 더빈-왓슨 통계량(Durbin-Watson Statistics) 오차항의 독립성을 평가하는 한 측도로 더빈-왓슨(Durbin-Watson) 통계량이 있다. DW 통계량은 잔차들의 상관계수를 측정함 DW 통계량이 2에 가까우면 인접 오차항들 사이에 상관관계가 없는 것을 의미하며, DW 통계량이 4에 가까우면 음의 상관관계가 있고 DW 통계량이 0에 가까우면 양의 상관관계가 있는 것으로 평가한다. 귀무가설 : 각 오차들은 서로 독립이다. 대립가설 : 각 오차들은 서로 독립이 아니다.

p-value가 0.4426, 즉 0.05보다 크다. 귀무가설: 각 오차들은 서로 독립이다. 독립성 조건 만족.

lmtest::dwtest(lm.train.step2)
## 
##  Durbin-Watson test
## 
## data:  lm.train.step2
## DW = 1.991, p-value = 0.4785
## alternative hypothesis: true autocorrelation is greater than 0

(4) 선형성(Linearity)

종속변수와 독립변수가 선형관계에 있다면 잔차와 독립변수 사이에 어떤 체계적인 관계가 있으면 안 된다.

par(mfrow = c(2, 2))
plot(lm.train.step2)

5단계 : 회귀모형을 이용한 예측

                             Estimate
log.gdp.per.cap               0.37435
social.support                1.40026
freedom.to.make.life.choices  0.98169
perception.of.corruption     -1.20868
positive.affect               3.44207
negative.affect               1.17269
confidence.in.govt           -0.99422
gini.house.hold.income       -1.54129

life.ladder = (0.37435 log.gdp.per.cap) + (0.98169 freedom.to.make.life.choices) + (0.43866 generosity) + (-1.20868 perception.of.corruption) + (-0.99422 confidence.in.govt) + (1.40026 social.support) + (3.44207 positive.affect) + (-1.54129 gini.house.hold.income) + (1.17269 negative.affect) + (1.40026 social.support)

Test Data 예측 확인결과

NA값 제외 후 총 Data Point : 148 예측안에 들어가는 결과 : 148 결론 : 모든 Test Data는 Train Data로 만들어진 Final Linear Model 예측안에 들어간다

# matching train colnames to test colnames
train.colnames <- colnames(df.train.mean2)
test.colnames <- colnames(df.test.mean)
predict.colnames <- test.colnames[train.colnames %in% test.colnames]

# creating the upper and lower bound for the predicting data
pred_probs <- predict(lm.train.step2, newdata = df.test.mean[, predict.colnames], interval = "predict", type = "response")

# testing if it goes inside the boundaries (1 if it does, 0 if it doesn't)
dt.pred_prob <- as.data.table(pred_probs)
dt.pred_prob$result <- ifelse((dt.pred_prob$lwr <= dt.pred_prob$fit) & (dt.pred_prob$upr >= dt.pred_prob$fit), 1, 0)
table(dt.pred_prob$result)
## 
##   1 
## 148
# MSE of our model
mean(lm.train.step2$residuals^2)
## [1] 0.1939984